PyData Cluj-Napoca, meetup #13, 2020.05.28

Summary

  • What to look at

  • How distributions look like

  • How to measure

  • How to investigate

Monitoring

https://play.grafana.org/

Timelines

Timelines

Timelines

Timelines

Histograms

Histograms

Histograms

Histograms

✓ ?

Histograms

Histograms

Tivnan et al. - Price Discovery and the Accuracy of Consolidated Data Feeds in the U.S. Equity Markets, Journal of Risk and Financial Management, 2018https://www.mdpi.com/1911-8074/11/4/73/htm

Quantiles plots

Quantiles plots

Distribution tail plot

Chart by factors

Chart by factors

Hasbrouck J., Saar G., Low-latency trading, Journal of Financial Markets, 2013 https://www.erim.eur.nl/fileadmin/erim_content/documents/Saar_Nov6.pdf

Take a step back

What are we measuring?

  • processing times
  • roundtrip times

(c) Andrei Pandele

Coordinated omission

Gil Tene - "How NOT to Measure Latency"https://www.youtube.com/watch?v=lJ8ydIuPFeU&t=15m50

Coordinated omission

Why not averages and few percentiles?

What does the 99th percentile mean?

Whatsapp (2018)

~65 billion messages per day
~500M daily active users
Assuming iid delays and avg # requests per user, 99th percentile slowness will affect (1 - 0.99^(65/0.5)) = 72% of the users

Depending on the number of requests per client:

Grafana dashboard again

Visualize the distribution

Visualize the distribution

Distribution tails

Throughput

Comparing performance

Comparing messaging systems performance

https://bravenewgeek.com/tag/coordinated-omission/

How distributions look like

Example 1: processing times

How distributions look like

Compare distributions

How distributions arise

How distributions arise

Examples of distributions

Examples of distributions

Types of distributions

Example 2: roundtrip times, single consumer, localhost, constant small processing time

Types of distributions

Example 3: roundtrip times, single consumer, internet (short distance), constant small processing time

Types of distributions

Example 4: roundtrip times, single consumer, internet (long distance), no processing time (instant reply)

Types of distributions

Example 5: roundtrip times, single consumer, localhost, heavy server-side calculations based on a random uniform input

How to measure

  • log everything (if possible)

  • HdrHistogram

  • add timestamps to packets and chain them

  • use high resolution timers (if suitable)

  • build time series with timestamps, counts, events

How to report latencies

What numbers mean

What clocks to use

https://www.python.org/dev/peps/pep-0418/

Clock synchronization

  • measure on the same machine (when possible)

  • use time synchronization services (NTP, Chrony, etc.)

  • triangulate different sources

  • sample times with a ping-like service

  • account for clock desynchronization

Investigate and improve

  • "how?": measure

  • compare distribution tail over time

  • know your throughput. improve. scale horizontally or collapse when not possible

  • offline profiling reveals some of the problems, make sure you have enough info also from prod logs

  • extract critical scenarios from prod logs. stress tests vs historical replays

Investigate and improve

  • your system is not a black box, you can debug it: find the factors that change your latencies

  • build time series with factors throughout the day

  • assess their importance

  • find the dimensions on which delays cluster

  • prioritize consumer tasks

Investigate and improve

  • source of delays are stochastic processes, try to identify and understand them

  • split delays into multiple transport and processing stages

  • reduce the number of rountrips needed to build a result

https://medium.com/@sachinkagarwal/public-cloud-inter-region-network-latency-as-heat-maps-134e22a5ff19

Investigate and improve

  • target the critical path and historical or theoretical cases of failure

  • check outliers, major failures may have a chain of things that went wrong

  • multimodal distributions are a sign of multiple paths, try to identify the split factor

  • drifts from parametric distributions